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Abstract 

New methods are introduced for improving the performance of the 
vectorized Monte Carlo SU(3) lattice gauge theory algorithm using the 
CDC CY8ER 205. Structure, algorithm and programming considerations are discussed. 
The performance achieved for a 16^ lattice on a 2-pipe system may be phrased 
in terms of the link update time or overall MFLOPS rates. For 32-bit arithmetic 
it is 36.3 usee/ link for 8 hits per iteration (40.9 usee for 10 hits) or 
101 .5 MFLOPS. 
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1. Introduction 


Many important results for quantum field theories in general and, in 
particular, for the gauge theory of strong interactions known as Quantum 
Chromodynamics (QCD) have been obtained by formulating the dynamics on a 
space-time lattice. The lattice version of a quantized gauge field theory, 
as proposed by Wilson [1], has the properties of introducing an ultraviolet 
cut-off independently of any perturbative expansion and of preserving 
manifest gauge invariance. It permits a variety of investigations by 
non-perturbati ve techniques, strong-coupling expansions [2] and Monte 
Carlo (MC) simulations [3] being the most notable ones. Monte Carlo 
simulations, indeed, have probably produced the most important results 
for QCD, being able to probe the structure of the theory in the domain 
where the transition between the strong-coupling behavior at large distances 
and the asymptotical ly-free behavior at small separation takes place. 

Numerical methods must be used to explore the vary crucial domain 
of intermediate couplings, since there are no known analytical techniques 
for solving or even efficiently approximating gauge theories throughout 
that region. On the other hand the fact that quantum fluctuations on 


a finite lattice extending for n sites in four dimensions are given 

4 , 

by integrals of a dimensionality 4n n^ (n^ is the number of independent 
parameters in group space), which can easily exceed 2,000,000, leaves 


importance sampling, i.e. Monte Carlo simulations, as the only cal cul ational 


possibil ity . 


Monte Carlo calculations are of a numerical nature, and quite 


demanding on computational resources. The simulation of a system with 
SU ( 3 ) gauge group (i.e. the system underlying QCD) on a lattice extending 
for n sites in each of the four space-time dimensions requires storage 
of 4n^ link variables, i.e. 4n 4 SU(3) matrices, and the systematic 
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replacement, or "upgrading", of each of these matrices with new, updated 
values, for several hundred or several thousand sweeps of the whole 
lattice. One MC iteration is defined as a sweep of the lattice, i.e., 
one upgrade per link variable. A computation involving M MC iterations 
thus implies 4Mn^ individual upgrades of SU(3) matrices. The upgrading 
of each SU ( 3 ) matrix requires approximately 4,150 elementary arithmetic 
operations and 180 table look-ups (if 10 attempts at changing the link 
variable are made for each upgrade). For a lattice large enough for 
obtaining physically meaningful results, the amount of computation needed 
for a Monte Carlo simulation of QCD becomes extremely high. 

Because of the aforementioned difficulties, Monte Carlo simulations 

of QCD have been generally limited to lattices of rather small extent, 

a lattice of 8^ sites already representing a large lattice with respect 

to the scale of most calculations. On the other hand, with the progress 

in the field, it has become apparent that one must definitely analyze 

larger systems to develop confidence in the numerical results. This need 

may be understood on physical grounds. If 2 GeV is considered as a 

universal energy for the effects of asymptotic freedom to begin manifesting 

themselves, one would like the lattice spacing to be smaller than (2GeV)“^ 

(and the correspondi ng ultraviolet cut-off larger than 2GeV) i.e. smaller 

than 0.1 fm. Conversely, if the goal of the computations is to determine 

hadronic structure, the extent of the lattice should be larger than the 

typical size of a hadron. Taking this size to be (minimally) 1 fm, 

it becomes apparent that the parameter n ought to be larger, if possible 

substantially larger, than 10. With, e.g., n = 16 and M = 1000 the 

1 2 

calculation of a MC simulation requires more than 10 operations not a 
small task even for the largest machines currently available. 
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■■min ii 


The number of the data elements involved, and the amount of 
computations needed for manipulating this data, makes it worth while 
to investigate ways for vectorization of the code. 

The purpose of this article is to illustrate the vectorization and 
implementation on the CDC CYBER 205 of a code for Monte Carlo simulations 
of the SU(3) lattice gauge theory. (For previous implementations of 
vectorized code see Ref. 4.) As will be discussed in more detail in 
the final section of this paper, the characteristics and performance are 
such that 1 MC iteration of a 16 2 * 4 lattice can be done in 10.72 seconds 
(corresponding to an upgrade time of 40.9 usee per SU(3) link variable). 
Thus, 16^ and larger lattices can be considered for meaningful 
simulations of QCD. While we describe in this article the program for 
the basic Monte Carlo algorithm, we are currently using it, together 
with other vectorized codes, for a reevaluation on a large lattice, of 
several quantities of theoretical and phenomenological interest in QCD. 
The results of these investigations will be presented separately [5], 

Here we proceed with a description of the computational algorithm and 
an outline of its vectorization in Sect. 2, with a more detailed 
account of the program in Sect. 3 and a summary of performance data in 
Sect. 4. 

2. The Monte Carlo Algorithm 

We consider a hypercubical lattice of n s sites in each of the 

three spatial directions and n t sites in the temporal one. The 
dynamical variables of the SU ( 3 ) gauge theory are 3x3 unitary- 
unimodular complex matrices, which are associated with the 4n s n t links 
of the lattice. We denote by the matrix associated with the 
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oriented link coming out of the lattice site of (integer) coordinates 
X = ( x-j ,x 2 ,x 3l x 4 ) in the direction y (y=l ,2,3,4). The goal of the 
Monte Carlo algorithm is to produce a stochastic sequence of configurations 
of the system , (a configuration being defined as the collection 

of all U^), such that the probability P(C) of encountering any 
configuration C in the sequence approaches, after a reasonable 
equi 1 ibriation time, the distribution 

P(C) a exp{-S(C)} , (2.1) 

where S is the action of the configuration C in that sequence. S 
is given by a sum over plaquette variables p , a plaquette being an 
oriented square of the lattice defined by the origin x and two directions 
y and v : 


S 



6 1(1 - i Re Tr U ) , 

p 0 r 


( 2 . 2 ) 


where 


U 



= U U^'-U -IT 
x x+v x+y x 


(2.3) 


6 is the coupling parameter and y , v stand for unit lattice 
vectors in the u and v directions, respecti vely. When Eqn. 2.1 is satisfied, 
quantum mechanical expectation values of observables (0, defined rigorously 
as averages over all possible configurations, namely 
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<&> * z' 1 [ n du“)5(u)exp[-s(u)] 


(2.4) 


wi th 


Z - f( n l£)exp[-S(U)] . (2.5) 

J X ^11 


can be approximated by averages taken over the configurations generated 
by the Monte Carlo algorithm: 

i=N 0 +N 

<&> ~ A 2 e'(C (i) ) . (2.6) 

1=N 0 +1 


Nq represents the number of initial configurations discarded in order 
to allow for the stochastic sequence to reach equilibrium. 

In our code we implement the MC algorithm following the method of 
Metropolis et al [6]. Each individual dynamical variable is 
replaced by a new one according to the following procedure: 

i) a new candidate matrix is obtained from l/JJ by group multiplication 


where is an SU( 3 ) matrix randomly selected from a prepared set 

{ R 1 , . . . , } of M matrices, to be discussed later. 
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ii) the change in action, AS induced by the variation IT ■+ u u 

X X 

is calculated: 

as = s(ujj' (2.7) 

iii) a pseudorandom number r with uniform distribution between 0 
and 1 is generated and 

U x = U x if r < ex P(" AS ) » 

U u = U u otherwise, 
x x 

The steps i) to iii) define what is called a "hit" on one of the link 
variables. These steps are repeated N h (number of hits) times. This 
completes the upgrading of one (link) variable . One MC iteration 

(or one sweep of the lattice) is executed when all the variables have 
been processed in this manner. 

A crucial consideration for the whole algorithm and also for its 
vectorization is that the calculation of the variation of the action 
AS involves only a few of the dynamical variables apart from 
itself, namely those defined on the remaining links of the six 
plaquettes which share the link between x and x+u . It is 
convenient to be slightly detailed at this point and to introduce some 
terminology. Given the link from x to x+u there are three 
"forward" plaquettes incident on it, namely those with vertices 

x, x+u, x+u+v and x+v , 
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(v taking the three values f y) and three "backward" plaquettes, 
namely those with vertices 

x, x+y, x+y-v and x-v , 


(see Fig. 1 ) . 

We shall define as the "force" acting on U y the sum of the expressions 


-yv 

f,x 


= U y+ 


u 


x+y x+v 


( 2 . 8 ) 


(corresponding to the forward plaquettes) and 



= U v ~ -u y -u 

x+y-v x-v 


vt^ 

x-v 


(2.9) 


(corresponding to the backward plaquettes) over the three values of 
v f y 


F x = 2 (F f V x + F b V x } ' (2.10) 

x vj^y T,x D,x 

One can easily convince oneself that of the terms contributing to the 
action in Eqn. 2.2 all those containing U y can be written in the form 

B[1 - -j ReTr(F^)] , (2.11) 


and therefore 



( 2 . 12 ) 


AS = - | ReTr[F“ f (U“'- ujj)] . 

Thus, we become aware of two fundamental facts: 

i) once the force is calculated, the subsequent hits on 
the link variable can be done without any further recourse to the 
values of other U variables. 

ii) several upgradings can be done in parallel, provided only that the 

forces required for the computation do not involve any of the U 3 4 

X X 

variables that are simultaneously upgraded. 

While point i) is relevant for any MC simulation, point ii) acquires 
particular importance if one wants to write a vectorized code. Indeed, 
as we shall show, all variables with fixed y can be separated into 

two sets such that the forces for one set only involve elements of the other. 
Then, all the variables belonging to one set can be grouped together 

in an array and upgraded simultaneously. Finally one proceeds to upgrade 
the elements of the other set (the red-black or checkerboard algorithm 
[4]). We will see in the next section that the ability to separate 
the link variables into two independent sets is a key to efficient vector- 
i zation. 

3. The Vectorized Implementation of the Algorithm 

The previous discussion has demonstrated that Monte Carlo lattice 
gauge theories are worthy candidates for vector processing. Until recently, 
however, people were doubtful as to whether the vector capabilities of current 
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supercomputers can be effectively utilized for such applications. The 
main source for this skepticism is the inherent conflict between random 
access to data, an integral part of a Monte Carlo process, and the strict 
order of data elements required for pipelined computations. In other words, 
unless data can be "gathered" at rates comparable to computation rates no 
efficient vectorization can be achieved. 

One of the major strengths of the CDC CYBER 205, and what makes it 
a particularly powerful Monte Carlo machine, is the ability to order a 
random collection of data by means of a vector instruction, namely, the 
"Gather" instruction. This instruction is equivalent to a series of 
random, or, indirect "load" operations on a serial computer. The 
Gather instruction uses a vector of integers as an "index-list" 
pointing to the elements to be fetched. These elements are stored 
in the order they have been encountered into an output vector. The 
result rate for the Gather operation is one element every 1.25 
cycles (a cycle, or clock-period on the CDC CYBER 205 is 20 nonoseconds) . 

For a comparison, note that the floating-point arithmetic rate, excluding 
division, is one element every cycle per pipe for 64-bit operands. The 
CYBER 205 hardware also supports 32-bit operations with twice the 
result rate for vector floating-point operations. For example, on a 
two pipe machine 32-bit arithmetic is performed at a rate of 5 nsec 
per result, or 200 MFL0P5. 

The effective utilization of the computational tools build into 
the vector processor is closely related to the data structure, as are 
most of the important algorithmic decisions. It is, therefore, 
appropriate, at this point, to discuss the memory requi rements . A 
3x3 complex matrix is represented by 18 real numbers. The constraints 
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of being unitary and unimodular reduce the number of independent para- 
meters to 8, but such a minimal representation of the variables 
implies a substantial increase in the computational complexity. To obtain 
optimal performance it is useful to keep all the 18 values representing 
the real and imaginary parts of the elements of . For a lattice 
with n s = n t = 16 a configuration will be defined by 18 x 4 x 16 4 = 
4.718592 million values, which may be more than can be put in the fast 
memory of many computer systems. Fortunately, the sequential nature of 
the MC algorithm suggests that only a fraction of the variables need 
to be in memory at any one time. The others can be kept on disk. 

The factors which determine an optimal size for the partition between 
variables in memory and on disk are the following: 

i) the partition should not make the code unnecessari ly complicated; 

ii) the I/O operations should not take longer than the actual computations 
i i i Sufficiently long vectors should be available. 

On the basis of the above requirements we decided to upgrade one 

space at a time, i.e. to upgrade all the 4n^ variables with fixed 

s x 

time coordinate x 4 , and then to proceed to the next x^ etc. We 
shall refer to this procedure as time-slicing and to the collection of 
variables with fixed time coordinate x^ as one time-slice of the system. 
If the variables with a given x^ = t are being upgraded, the 
calculation of the force requires knowledge of the with x^ = t-1 , 
x^ = t and x^ = t+1 . Thus 3 time slices need to be in memory 

throughout this stage of the calculation. As a matter of fact, since 
I/O operations can proceed independently from CPU operations, it 
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is possible to achieve concurrency of I/O and CPU operations if 
extra memory buffer space is allocated for holding the x^ = t-2 slice 
(to be written out), and the = t+2 slice (to be read in). The 
conventional way of implementing concurrent I/O is to allocate space 
for two more slices. The resulting five slices in memory act as a 
circular buffer as shown in Fig. 2. However, the virtual memory hardware 
on the CDC CYBER 205, and the supporting software provide the capability 
to swap data between disk and memory. Hence, the memory area of one 
slice only is needed to write out the = t-2 slice, and read in 
the x^ = t+2 slice. Consequently , the total memory requirements for 
the link variables are thus 4 x n^ x 4 * 18 locations. Allowing for some 
additional work-space we find that lattices with n $ = 16 can be 
considered in a machine with 2m words (16m bytes) in full precision 
(64-bit words) and n s - 20 in half precision (32-bit words). The length 
in time does not constitute a problem any longer and lattices with any n^ 
may be simulated. 

With the slicing mechanism in place we now turn to vectorization 
aspects of the code. In Sec. 2, the Red-Black ordering was introduced. 

The motivation for this choice merits some discussion. The computation 
involves, mainly, matrix multiplications. This operation is easily 
vectorized, but the matrices concerned are 3x3 matrices, and the resulting 
vectors are going to be 3 elements long. For efficiently vectorized code 
one needs to seek longer vectors. This results from the observation 
that the timing formula for a vector instruction may be written as 

(Start-up + a-N) cycles (3.1) 
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where the start-up time is a constant, independent of the vector length. 

It amounts to aligning the input and output streams, filling up the 
pipelines up to the point where the first result is available and storing 
the last result. The start-up time is also independent of the number of 
pipelines and whether 64-bit or 32-bit arithmetic is performed. On the 
CDC CYBER 205 it amounts to about 50 cycles, or 1 usee. The "c^N" term 
is known as the "stream time". N is the number of elements in the vector, 
so that the stream time is proportional to the vector length, a is a 
constant associated with the number of pipelines and the arithmetic mode. 
Table 3.1 contains the a values for some relevant circumstances. 

It is now obvious that high performance is achieved by minimizing the number 
of "start-ups" as a consequence of using longer vectors, or, increasing 
N for each vector operation. 

The SU(3) matrices are too small as an object for vectorization ; 
however, there are n^ such matrices in every time slice. One cannot 
use all of these link values simultaneously because - 

i) updating each link requires all its immediate neighbors, and 

ii) the correct convergence of the Metropolis process depends upon 
using "new" values as soon as they are available. 

The Red-Black (checker-board) ordering resolves this apparent recursive 
relationship. The separation of the variables into two sets, for 
each value of y and at fixed , is achieved by putting in the two 
sets all the variables belonging to links originating from odd and even 
sites, respectively, i.e. with + x^ + = 1 or 0 (mod 2). This 

assures the independence of the forces from the variables ujj being 
upgraded. On a lattice with n s = 16 the above separation gives a vector 

■3 

length of n^/2 = 2048, sufficiently large to insure almost optimal 
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performance (in fact, 91% and 95% in 32-bit and 64-bit arithmetic, 
respectively). The calculation of the force requires knowledge 
of the variables associated with links neighboring the one under 

consideration. Because of boundary conditions, which we take to be periodic, 
the variables which enter the calculation of F^ will not, in general, 
have a simple location-index relative to in the array of dimension 

n^/2 . This is easily remedied by the introduction of auxiliary 

i nteger-val ued arrays, where the indices of the various neighbors of 
are prestored. The Gather instruction plays a crucial role in the 
way these index arrays are used. When F^ is evaluated, all the 
needed variables are gathered into temporary arrays, so that the indices 
of all elements entering into the computation of F^ are the same, and 
this proceeds in a fully vectorized manner. 

Once. the F^'s are determined the algorithm for the upgrading 

of all the UjJ (in the same set) is strai ghtforward and completely 

vectorizabl e. The matrices R which are used for finding the new 

candidates , are Gathered according to an array of indices extracted 

at random from a table. The table contains M SU(3) matrices which have 

a distribution centered around the identity of the group and are obtained 

in the following fashion. For each value of i between 1 and M/2 

(M must be even) an eight component vector with approximately 

2 

gaussian distribution and <V^> = 1 is pseodoramdomly generated. The 
fourth-order approximation to R. is given by 



+ iA - ~- 



Z exp(iA) , 


(3.2) 
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where 


a = b r v k x k 


(3.3) 


are Gell-Mann's matrices (i.e., a set of generators of the Lie 
algebra of SU(3)) and b is a real parameter specifying the spread 
of the distribution. The final value for R- is obtained by normalizing 
R? to a uni tary-unimodular matrix. In general, if we denote the 
three columns of an SU(3) matrix by r ]* r 2 and r^ the constraint 
of being unitary and unimodular is expressed by 


and 





= 0 


< r l* r 2 ) * 


(3.4) 


0 ->n _»o 

Given a matrix R with the first two columns r-| and r ^ > with 

ft 

r ] xr 2 t 0 » we shall define as the normalized form of R the matrix 

R with columns 






r Z ~ {r 2 " r l ^ r l >' 2 


(3.5) 


and 


r r r 2 


* 
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The reason for the nonmencl ature is due to the fact that, if R^ 1 differs 
slightly from a uni tary-unimodul ar matrix, e.g. as a consequence of 
roundoff errors, then R is an SU ( 3 ) matrix close in value to R^ 

Thus, the approximately unitary-unimodular matrix R? obtained by 
truncated exponentiation in Eqn. 3.2 is converted to a proper SU ( 3 ) 
matrix R^ by normalization. The last M/2 matrices are obtained by 

R m = R? (1 < 1 < ”) (3.6) 

T ' 1 

so as to insure that, together with any given matrix R.. , the inverse 

should also belong to the table. 

The procedure for normalizing the SU(3) matrices of the random 
table, as described above, is also applied, every few iterations, to the 
link matrices. This is done to insure that the group symmetry of the 
matrices is preserved regardless of rounding errors which may be 
introduced by the hardware after many arithmetic operations. This 
renormal ization process is particularly important when the computations 
are performed using low precision arithmetic. It gives us confidence, which 
was also tested and verified, in using 32-bit arithmetic for our calcul- 
ations on the CDC CYBER 205. 

Once is determined, using the table of random SU ( 3 ) 

matrices, the action difference is obtained by calculating, separately, 

ReTr( rj/ujj) and ReTr( ) 

(notice that ReTr(A'B) is the vector product of the arrays containing 
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the real and imaginary parts of A and B) » forming an array with 
exp(-AS) , comparing with an array of pseudorandom numbers and 
accepting or rejecting the change, via a masking operation, according 
to the outcome of the vectorized comparison between the random numbers 
and the exponentiated action differences. These steps are repeated 
for a prefixed number of hits before commencing the upgrade of the other 
set or the variables corresponding to different directions. 

The conditional acceptance of elements in a vector, or, the masking 
operation referred to above, is handled through the usage of a "bit-vector" 
(the CDC CYBER 205 is bit addressable and the software allows the Fortran 
user to use this feature). It is exploited as a part of the vector 
instruction, and inhibits storing results where zeros are encountered 
in the bit-vector. 

The reader should by now realize that many thousands of random 
numbers are required for each iteration. The conventional congruent 
method for generating random numbers is recursive, and may be described 
by 


y i+1 = (a-y-)mod(b) (3.7) 

where a is the "multiplier" and b is determined so as y. + i will 
be approximately the lower half of the coefficient of the product a - y^. 
The nature of this calculation suggests that in order to produce N 
random numbers one has to repeat it serially N times. There is, 
however, a way to reproduce the same sequence of N numbers in 
parallel, using vector instructions [7]. Define a new multiplier by 
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i^hi mi 'ilium 


A = (a N )mod(b) 

= ( . . . (a*a)mod(b)*a)mod(b) . . ,*a)mod(b) (3.8) 

and let 

—1 = ^1 *^2 * * * * (3*9) 

be the vector containing the first N random numbers. 

Then 

Y i+1 = (A*Y i )mod(b) (3.10) 

reproduces the same sequence of random numbers one gets with a 
repeated application of Eqn. 3.7 (the computation of Eqn. 3.10 requires 
only 3 vector operations on the CDC CYBER 205). 

To conclude this section, let us discuss the way matrix multiplication 
is done, being the most time-consuming aspect of the computation . First, 
the reader will remember that we do not vectorize the matrix multiplication 
as such, but, rather, perform the operations on many matrices in parallel, 
where for each matrix the "scalar" sequence of operations is followed. 

When computing the products of two SU ( 3 ) matrices, one need not 
evaluate all the columns of the result, since the third column of the 
product matrix (which is again uni tary-uni modular) is related to the 
first two by Eqn. 3.4. In the code we have exploited this fact whenever 
possible. It is particularly advantageous when several SU ( 3 ) matrices 
must be multiplied together, since one may limit the calculations to 
two columns out of three in all intermediate products and simply 
reconstruct the third column of the final result as shown in Eqn. 3.4. 
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Finally, all complex arithmetic has been done in terms of real 
variables, separating real and imaginary parts (which would also 
result in a more efficient code for a scalar machine), and we have 
used the identity 

(A+iB) (C+i D) * (A+B) (C-D)-BC + AD + i(BC+AD) (3.11) 

to perform the product of two complex matrices in terms of three real 

multiplications and five real matrix additions. Using complex 

arithmetic the product of two matrices would require four real 

multiplications and two additions. Due to the fact that matrix 

2 

multiplication requires 2N operations, where N is the dimension 

2 

of the matrix, and matrix addition requires only N operations, our 
method pays off even for N = 3 . 

A schematic outline of the flow of the calculations is shown 
in Fig. 3 . 

4. Performance and Timings 

The figures quoted here are based on runs executed on a two-pipe, 

2m 64-bit words CDC CYBER 205. They apply to a 16^ lattice (n $ = 16, = 16), 

SU(3) gauge theory with 10 hits per link upgrade (unless stated explicitly 
otherwise). We present performance figures for both 64-bit and 32-bit 
arithmetic operations. In both modes the exponentiation and the generation 
of random numbers were carried out using 64-bit arithmetic. It should be 
noted here that due to our slicing mechanism the 32-bit version requires 
real memory of only 852,000 words (64-bit words, or 6.8m bytes), so it 
actually fits comfortably on a 1m words system. With these parameters 
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the code performs at 98% CPU utilization. The 64-bit version requires, 
of course, twice as much memory. 

In Table 4.1 we give the percentage of the execution time for the 
two arithmetic modes spent in the force (Fjj) and the Metropolis 
updating calculations. It becomes clear from these figures why it is 
worth while using a single force computation for a number of attempts at 
updating (rather than the one attempt proposed by the original Metropolis 
method) . 

It should be added here the normalization procedure discussed in 
Sec. 3, performed every 5 iterations adds only 0.74% and 0.59% in 
64-bit and 32-bit modes, respectively, to the total execution time. 

Table 4.2 presents a percentage breakdown of the code by operation 
type. The reader will notice that the Gather, random number generation 
and the exponentiation operations are more heavily weighted in the 32-bit 
mode compared with that of the 64-bit mode. These three types of 
operations perform at the same rate in both modes. The last two 
execute in 64-bit mode in both versions of the code. The Gather instruction 
performs at the same rate regardless of whether the operands are 64-bit 
or 32-bit variables. This is because the performance of the Gather 
operation is driven by memory access (and not by computation complexity). 

The matrix multiplication, being made up of floating-point operations 
only, executes at near peak rate of 95 MFIOPS and 182 MFLOPS for the 
64-bit and 32-bit modes .respectively. The effect of vectorizing the 
random number generator can be illustrated by noting that this operation 
amounted to 6% (64-bit) and 11% (32-bit) of the total time when it was 
not vectorized. The "action" involves taking the real part of the 
trace of products of SU(3) matrices (purely floating-point operations). 

The "acceptance" is the portion of the code where the conditional acceptance 
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of new matrices occurs under the control of a bit-vector 
created for that purpose. 

The actual time for one iteration of the 16^ lattice 
with 10 hits is 16.27 secs. (64-bit) and 10.72 secs. (32-bit). This 
amounts to a substained performance rate of 66.8 MFLOPS (64-bit) and 
101.5 MFLOPS (32-bit). Another way, commonly used by physicists, to 
express the performance of Monte Carlo lattice gauge theories implemented 
on a computer system, is the link update time, i.e., the time needed 
to update one link of the lattice once. This measure is useful for 
comparisons since it is independent of the lattice size. The link 
update times (in ysecs.) for our implementation are given in Table 4.3. 
These figures may be compared to a link update time of about 1,100 
ysecs on the CDC 7600 computer system with a highly optimized code. 
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Table 3.1. Stream rate proportionality factor (ct) . 



Table 4.1. Breakdown by percentage of sections 
of code. 



64- bi t 

32-bit 

1 

force 

| 43.49 

1 42.46 

i 

f 

joaate 

‘ 55.40 

57.40 

i 


Table 4.2. Breakdown by percentage of the main operation 
types. 


operation type 

64-bit 

32-bit 

1 

matrix multiplication 


58.33 

47.05 

Gather 


20.78 

29.27 

random number generator 

0.95 

1 .83 

exponentiation 


7.43 

11.72 

action 


5.93 

4.70 

acceptance 


3.62 



3.01 

Table 4.3. The upgrades 

times for a link (in usees). 

number of hits 

‘64-bit 

32-bit 

10 


62.1 

40.9 

8 


55.1 

36.3 
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Figure 1. "Forward" (upper half) and 
"backward" (lower half) plaquettes in the 
y-v plane, where x = ( x i fX 2 • x 3 » x ^) is a 

point in our four-dimensional lattice. 

This is one out of three such planes which 
can be formed in a four-dimensional space. 
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FLOW CHART 



Figure 3 . Schematic description of the computational 

process . 
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